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What  are  appropriate  paradigms  for  supporting  fault-tolerant  applications  and  how  can  they  be 
implemented  efficiently? 

To  what  extent  can  fault  tolerance  be  retrofitted  into  existing  applications  automatically? 

What  lessons  can  be  learned  from  existing  implementations  of  fault-tolerant  and  distributed 
systems?  -  '  ,i[d>  J 
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Introduction 

Optimistic  rollback  recovery  methods  can  efficiently  and  transparently  provide  fault  tolerance  for  applications 
executing  in  a  distributed  system.  With  rollback  recovery,  information  saved  on  stable  storage  during  failure- 
free  execution  allows  certain  states  of  each  process  to  be  recovered  after  a  failure.  Examples  of  such  methods 
include  those  using  message  logging  and  checkpointing  [11,7,3,  13,8,  12],  and  those  using  checkpointing 
alone  [10,  4,  2].  Optimistic  methods  in  general  allow  unrecoverable  states  of  one  process  to  be  seen  by  other 
processes,  and  optimistically  assume  that  these  states  will  become  recoverable  before  a  failure  occurs.  This 
allows  the  needed  recovery  information  to  be  saved  on  stable  storage  asynchronously,  reducing  failure-free 
overhead.  However,  if  after  a  failure,  these  states  are  not  recoverable,  processes  other  than  those  that  failed 
may  also  be  required  to  roll  back  in  order  to  restore  the  system  to  a  consistent  state. 

We  have  developed  a  theoretical  model  for  reasoning  about  optimistic  rollback  recovery  methods  [8,  6], 
and  have  shown  that,  in  any  system  using  optimistic  rollback  recovery,  there  always  exists  a  unique  maximum 
recoverable  system  state.  We  have  also  developed  two  algorithms  for  finding  this  maximum  recoverable 
system  state.  These  results  can  be  applied  to  systems  in  which  all  execution  of  processes  between  received 
messages  is  assumed  to  be  deterministic  (e  g.,  message  logging  and  checkpointing  methods),  and  to  systems  in 
which  no  such  assumption  is  made  (e.g.,  checkpointing  methods).  We  have  completed  a  full  implementation 
of  optimistic  message  logging  and  checkpointing  on  a  network  on  SUN  workstations  under  the  V-System,  and 
performance  measurements  from  it  demonstrate  the  efficiency  of  this  method  [5].  The  overhead  on  individual 
communication  operations  averaged  only  10  percent,  and  the  overhead  on  distributed  application  programs 
ranged  from  a  maximum  of  under  4  percent  to  much  less  than  1  percent. 

This  paper  briefly  describes  the  current  status  of  our  research.  We  also  discuss  some  of  its  limitations 
and  present  a  new  algorithm  that  addresses  these  limitations  [9],  This  algorithm  dynamically  supports  both 
deterministic  and  nondeterministic  processes,  and  allows  processes  to  individually  switch  between  using 
message  logging  and  checkpointing  or  using  checkpointing  alone. 
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2  Current  Status 


Our  model  concisely  captures  the  dependencies  that  exist  within  the  system  that  result  from  communication 
between  processes.  The  execution  of  each  process  is  divided  into  a  sequence  of  state  intervals ,  such  that  in 
terms  of  the  rest  of  the  model,  all  individual  states  of  a  process  within  any  single  state  interval  are  equivalent 
from  the  point  of  view  of  all  other  processes  in  the  system.  The  differences  between  a  deterministic  and  a 
nondeterministic  system  are  limited  to  the  respective  definitions  of  process  state  intervals.  A  state  interval 
is  called  stable  if  and  only  if  some  state  of  the  process  within  that  interval  can  be  recreated  from  information 
on  stable  storage  after  a  failure. 

The  current  dependencies  of  a  process  are  represented  by  a  dependency  vector ,  and  a  system  state  is 
represented  by  a  dependency  matrix.  A  system  state  is  recoverable  if  and  only  if  it  is  consistent  and  each 
individual  process  state  is  stable.  The  process  states  that  make  up  a  system  state  need  not  all  have  existed  at 
the  same  time.  A  system  state  is  said  to  have  occurred  during  some  execution  of  the  system  if  all  component 
process  states  have  each  individually  occurred.  The  system  history  relation  defines  a  partial  order  on  these 
system  states,  such  that  one  system  state  precedes  another  if  and  only  if  it  must  have  occurred  first. 

With  this  model,  we  have  proven  some  important  properties  of  any  system  using  rollback  recovery.  First, 
the  set  of  system  states  that  have  occurred  during  any  single  execution  of  a  system,  ordered  by  the  system 
history  relation,  forms  *  lattice,  called  the  system  history  lattice ,  with  the  sets  of  consistent  and  recoverable 
system  states  as  sublattices.  During  execution,  there  is  thus  always  a  unique  maximum  recoverable  system 
state,  which  never  decreases.  We  have  also  proven  sufficient  conditions  for  committing  output  from  the 
system  to  the  “outside  world,”  and  for  removing  recovery  information  from  stable  storage  when  no  longer 
needed. 

We  have  developed  two  algorithms  for  determining  this  maximum  recoverable  system  state,  and  have 
used  the  model  to  prove  their  correctness.  The  first  algorithm  finds  the  maximum  recoverable  system  state 
“from  scratch”  each  time  it  is  invoked,  whereas  the  second  algorithm  is  incremental,  beginning  its  search 
with  the  previously  known  maximum  and  utilizing  information  saved  from  its  previous  executions  to  shorten 
its  search.  We  have  completed  an  implementation  of  optimistic  message  logging  and  checkpointing  using  this 
first  algorithm,  running  under  the  V-System.  Our  performance  measurements  from  this  implementation,  on 
a  network  of  SUN-3/60  workstations,  can  be  summarized  as  follows: 

•  The  overhead  on  individual  V-System  communication  operations  averaged  only  10  percent,  ranging 
from  about  18  percent  to  2  percent,  for  different  operations. 

•  During  a  checkpoint,  the  execution  of  the  process  is  suspended  typically  for  only  a  few  tens  of  millisec¬ 
onds,  since  most  data  is  written  to  the  checkpoint  before  suspending  the  process.  The  total  time  to 
complete  the  checkpoint,  though,  is  dominated  by  the  time  required  to  write  the  modified  pages  of  the 
user  address  space  to  the  checkpoint,  and  is  is  about  3  seconds  per  megabyte  written. 

•  The  total  overhead  experienced  by  distributed  application  programs  is  affected  most  by  the  amount 
of  communication  performed  during  execution.  We  measured  the  performance  of  programs  for  solving 
the  n-queens  problem,  the  traveling  salesman  problem,  and  Gaussian  elimination  with  partial  pivoting. 
The  total  overhead  ranged  from  a  maximum  of  under  4  percent  to  much  less  than  1  percent. 

•  During  failure  recovery,  the  running  time  of  the  algorithm  is  negligible  relative  to  the  time  required 
to  restore  the  processes  from  their  checkpoints  and  to  replay  the  logged  messages  to  the  recovering 
processes. 

To  our  knowledge,  this  is  the  only  existing  complete  implementation  of  fault-tolerance  using  optimistic 
message  logging  and  checkpointing. 

3  Limitations  and  Future  Directions 

The  lack  of  support  for  nondeterministic  process  execution  is  a  significant  limitation  to  current  methods 
using  message  logging  and  checkpointing.  Nondeterministic  execution  can  arise,  for  example,  through  asyn- 
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chronous  scheduling  of  multiple  threads  accessing  shared  memory.  To  recover  the  state  of  a  process  using 
message  logging  and  checkpointing,  the  sequence  of  messages  originally  received  by  the  process  after  its  check¬ 
point  are  replayed  to  it.  The  process  is  assumed  to  reexecute  deterministically  based  on  these  messages,  and 
to  reach  the  same  state  as  it  had  after  receiving  them  before  the  failure.  If  process  execution  can  be  nonde- 
terministic,  recovery  will  not  be  successful.  This  limitation  does  not  affect  methods  using  only  checkpointing, 
since  only  process  states  recorded  in  checkpoints  are  used  for  recovery. 

Another  limitation  of  current  message  logging  and  checkpointing  methods,  which  is  shared  by  methods 
using  only  checkpointing,  is  the  difficulty  of  committing  output  from  the  system  to  the  “outside  world." 
Output  must  be  delayed  until  it  is  guaranteed  that  the  sending  process  will  never  roll  back  beyond  the 
state  from  which  the  output  was  sent.  Essentially,  this  requires  that  all  of  that  state’s  causal  antecedent 
states  must  be  able  to  be  recovered  after  a  failure.  Without  coordination  between  output  and  the  message 
logging  or  checkpointing  of  individual  processes,  the  delays  in  committing  output  may  be  substantial.  These 
delays  can  be  reduced  by  logging  or  checkpointing  more  frequently,  but  this  may  significantly  increase  the 
failure-free  overhead  of  the  system. 

These  observations  lead  us  to  a  new  algorithm  [9]  in  which  recording  the  needed  recovery  information  on 
stable  storage,  and  determining  the  current  maximum  recoverable  system  state,  are  both  driven  by  the  need 
to  commit  output  from  the  system  to  the  outside  world.  This  algorithm  allows  all  output  to  the  outside 
world  to  be  committed  quickly  after  being  sent,  while  reducing  the  overhead  required  to  determine  when  such 
output  can  be  committed.  The  algorithm  further  reduces  fault-tolerance  overhead  by  avoiding  the  logging  of 
messages  not  needed  to  allow  pending  output  to  be  committed.  Each  process  commits  its  own  state  intervals 
as  needed,  and  requires  the  cooperation  of  the  minimum  number  of  other  processes. 

The  issue  of  nondeterministic  execution  is  addressed  by  allowing  individual  processes  to  dynamically 
switch  between  using  message  logging  and  checkpointing  or  using  checkpointing  alone.  We  assume  that 
processes  can  detect  when  their  execution  is  nondeterministic,  such  as  through  a  trap  caused  by  the  memory 
protection  hardware.  Processes  can  use  message  logging  during  deterministic  execution,  to  avoid  recording  a 
new  checkpoint  each  time  they  or  other  processes  that  depend  on  them  need  to  commit  output  to  the  outside 
world.  During  nondeterministic  execution,  the  process  converts  to  using  only  checkpointing.  This  feature 
can  also  be  used  by  individual  processes  to  reduce  the  overhead  of  message  logging.  Processes  can  decide 
not  to  log  received  messages  during  arbitrary  periods  of  their  own  execution.  This  saves  the  overhead  of 
copying  each  received  messages  to  a  buffer  in  volatile  memory,  and  the  overhead  of  later  writing  this  buffer 
to  stable  storage.  However,  processes  must  then  record  a  new  checkpoint  when  they  or  other  processes  that 
depend  on  them  need  to  commit  output  to  the  outside  world. 

We  contend  that  there  are  significant  advantages  in  allowing  each  process  a  dynamic  choice  between 
message  logging  and  checkpointing.  In  particular,  if  one  or  more  processes  in  the  computation  are  nonde¬ 
terministic,  they  would  always  use  checkpointing,  while  other  processes  may  choose  to  use  message  logging. 
Furthermore,  if  a  process  is  known  to  be  deterministic  most  of  the  time,  but  occasionally  experiences  de¬ 
tectable  nondeterministic  events,  it  may  choose  to  use  message  logging  during  deterministic  periods,  but 
turn  off  message  logging  after  it  has  experienced  a  nondeterministic  event.  After  a  subsequent  checkpoint, 
message  logging  may  be  turned  on  again  until  another  nondeterministic  event  occurs.  Deterministic  processes 
can  choose  between  message  logging  and  checkpointing  depending  on  which  they  perceive  to  be  the  least 
expensive  at  any  particular  time.  If  a  process  receives  a  large  number  of  messages  during  some  period  of 
time,  it  may  choose  to  record  a  new  checkpoint,  thereby  eliminating  the  need  for  writing  these  messages  to 
stable  storage.  If,  on  the  other  hand,  a  process  receives  few  messages,  but  has  a  large  and  rapidly  changing 
address  space,  it  may  instead  decide  tc  log  these  few  messages  and  postpone  taking  an  expensive  checkpoint, 
until  it  becomes  necessary  to  do  so  for  limiting  the  recovery  time.  This  algorithm  can  be  viewed  as  unifying 
the  spectrum  of  methods  between  checkpointing  alone  and  message  logging  and  checkpointing. 

We  are  also  examining  methods  for  exploiting  limited  application-specific  knowledge  to  reduce  the  over¬ 
head  of  message  logging,  while  still  being  transparent  to  the  application.  Our  current  approach  to  this 
is  in  the  environment  of  a  distributed  shared  memory  system.  In  such  a  system,  all  messages  between 
processes  are  generated  by  the  shared  memory  system.  Fault  tolerance  could  be  provided  by  simply  log¬ 
ging  these  messages,  but  we  believe  it  would  be  far  more  efficient  to  take  advantage  of  the  knowledge  that 
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these  messages  are  sent  to  emulate  a  specific  shared  data  structure.  In  a  separate  project,  we  are  currently 
developing  a  distributed  shared  memory  system  using  memory  coherence  mechanisms  that  are  specific  to 
the  particular  access  pattern  of  each  object  [1],  As  a  simple  example,  if  the  shared  memory  system  knows 
that  a  particular  object  is  “read-only,”  accesses  to  it  need  not  be  logged.  Wu  and  Fuchs  [14]  have  recently 
proposed  a  pessimistic  method  for  providing  fault  tolerance  in  a  distributed  shared  virtual  memory  system, 
which  in  general  requires  processes  to  checkpoint  on  each  interaction.  We  are  interested  in  pursuing  a  more 
optimistic  approach,  reducing  the  number  of  checkpoints  without  reintroducing  overhead  elsewhere. 
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